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Abstract 

This paper introduces QCDLAB, a design and research tool for lattice QCD algorithms. The tool, a collection of MATLAB 
functions, is based on a "small-code" and a "minutes-run-time" algorithmic design philosophy. The present version uses the 
Schwinger model on the lattice, a great simplification, which shares many features and algorithms with lattice QCD. A typical 
computing project using QCDLAB is characterised by short codes, short run times, and the ability to make substantial changes in a 
few seconds. QCDLAB 1.0 can be downloaded from the QCDLAB project homepage http : / /phys . f shn . edu . al/qcdlab . ht: 
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1 Advancement of Lattice QCD 



Lattice QCD, an industrial-range computing project, is in its fourth decade. It has basically two major computing problems: sim- 
ulation of QCD path integral and calculation of quark propagators. Generally, these problems lead to very intensive computations 
and require high-end computing platforms. 

However, we wish to make a clear distinction between lattice QCD test and production codes. This is very important in 
order to develop a compact and easily managable computing project. While this is obvious in theory, it is less so in lattice QCD 
practicing: those who write lattice codes are focused primary on writing production codes. What is usually called test code is 
merely a test of production codes. 

The code of a small project is usually small, runs fast, it is easy to access, edit and debug. Can we achieve these features for 
a lattice test code? Or, can we modify the goals of the lattice project in order to get such features? In our opinion, this is possible 
for a minimal test code, a test code constisting of a minimal possible code which is able to test gross features of the theory and 
algorithms at shortest possible time and largest acceptable errors on a standard computing platform. This statement needs more 
explanation: 

a. Although it is hard to give sharp constraints on the number of Unes of the test code, we would call "minimal" that code 
which is no more than a few printed pages. 

b. The run time depends on computing platforms and algorithms, and the choice of lattice action and parameters. It looks 
hke a great number of degrees of freedom here, but in fact there are hardly good choices in order to reduce the run time of 
a test code without giving up certain features of the theory. Again, it is tremendously difficult to give run times. However, 
a "short" run time should not exceed a few minutes of wall-clock time. 

c. We consider a computing platform as being "standard" if its cost is not too high for an academic computing project. 

d. We call simulation errors to be the "largest acceptable" if we can distinguish clearly signal form noise and when gross 
features of the theory are not compromised by various approximations or choices. 

e. Approximations should not alter basic features of the theory. The quenched approximation, for example, should not be 
considered as an acceptable approximation when studying QCD with light quarks. 

A test code with these characteristics should signal the rapid advance in the field, in which case, precision lattice computations 
are likely to happen in many places around the world. Writing a minimal test code is a challenge of three smarts: smart computers, 
smart languages and smart algorithms. 

In this paper we introduce the first version of QCDLAB, QCDLAB 1 .0, a collection of MATLAB functions for the simulation 
of lattice Schwinger model. This is part of a larger project for algorithmic development in lattice QCD. It can be used as a small 
laboratory to test and validate algorithms. In particular, QCDLAB 1.0 serves as an illustration of the minimal test code concept. 

QCDLAB can also be used for newcomers in the field. They can learn and practice lattice projects which are based on short 
codes and run times. This offers a "learning by doing" method, perhaps a quickest route into answers of many unknown practical 
questions concerning lattice QCD simulations. 

The next two sections describe basic algorithms for simulation of lattice QCD and foundations of Krylov subspace methods. 
Then, we present the QCDLAB 1.0 functions followed by examples of simple computing projects. The last section outUnes the 
future plans of the QCDLAB project. 
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2 Simulation of lattice QCD 



Notations and Problem definition 

Lattice gague fields U/^j, our basic degrees of freedom, are defined on oriented links i i + ji of a four dimensional hypercubic 
lattice with N sites and lattice spacing a. Here, i is a four component index labeling the lattice sites, /j. = 1,2,3,4 labels 
dierctions in the Euclidean space, and fi is the unit vector along /^-direction. Algebraically, a lattice gauge field is an order 3, 
complex valued unitary matrix with determinant one, an element of the SU (3) colour group. 

The basic computational task in lattice QCD is the generation of ensembles of guage field configurations according to prob- 
ability density: 

Pqcd{U) ~ det{D*D) e'^'^^) , 

where D is the lattice Dirac operator, 

is the gauge action, /? = is the gauge-boson coupUng constant, and 



p 



is the plaquette in plane, a 4-link product defined as in the figure. 
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We have assumed here a fermion theory with two degenerate flavours of quark masses which suffices for the purpuse of 
this paper. There are two main formulations of lattice fermions: Wilson [1] and Kogut-Susskind [2], called also staggered, 
discretizations of the Dirac operator. Wilson operator. Unking sites i and j, is given by 

Dij = (m + -)/4 (g) I3 6ij - i y'[(/4 - 7m) ® U^^i5ij+jx + (J4 + 7^) ® U^^i_(,5i^j_jx] , 

ft 

whereas Kogut-Susskind operator is given by 

Dij = mis 6ij + i ^(-l)^^+-+'''-(C/^,,^i,,+0 - U^,i.f,5ij.ii) . 



Here m is the bare quark mass, 7^ are anticommuting Hermitian gamma-matrices acting on Dirac space 

7At7i^ + 7;/ 7/1 = 2(5^,./4 , 

and (gi denotes the direct or Kronecker product of matrices. Hence, Wilson and Kogut-Susskind operators are complex valued 
matrices of order 12N and 3A'' with 49A'^ and 25N nonzero elements respectively. 

Note that the difficulty of handling the determinat of a huge matrix can be softened using the Gaussian integral expression: 



,. N 



(2.1) 



i=l 



where cj) is a complex valued field, a pesudofermion field. Thus, the detemiinant is traded for the inversion. All we need now is 
to generate ensembles according to the new density: 

Pqcd{UA) - exp{-5g([/) - . 
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Hybrid Monte Carlo Algorithm 

The HMC algorithm [3] starts by introducing su{3) conjugate momenta P to SU (3) lattice gauge fields. Hence, the classical 
Hamiltonian can be written in the form 



H{P,U,cP) = -uP' + Sg{U)+r{DD*)-\ 



whereas expanded probabiUty density is 

pqcd{p,u,<p) eMnp,u,m ■ 

The idea of the HMC algorithm is as follows: 

i) Use global heatbath for pseudofermion field update. If C is a Gaussian pseudofermion field, the new field is updated 
according to the equation 

<^ = DC. 

ii) Integrate numerically classical equations of motion. 

iii) Correct numerical integration error using Metropolis et al algorithm. 

Classical Equations of Motion 

The first first equation of motion is defined using conjugate momenta: 

IT =iP U ■ 

For the second equation one writes down the total derivative of the Hamiltonian: 

n = ^trP^,iP^,i + Sg- r{DD*)-^DD*{DD*)-'cl) + h.c. , 

Substituting for Un^i the first equation of motion and using 

n = o, 

one finds the second equation of motion 

P — F 

From H expression, it is clear that in order to evaluate the force F^^j, one must calculate {DD*)~^<j). 
Leapfrog Algorithm 

The widely used algorithm for solving equations of motions is the leapfrog algorithm 

O - P +F ^ 

— At)^ "i" 2 
TJ' . = JQfi,i^tTT . 

P'lJ,,i ~ Qix,i + Ffj,,i ) 

where the primed fields are those advanced by At and Q^^i are half-step momenta. The algorithm starts at t = 0, where momenta 
are taken to be Gaussian noise. Then it continues up to time t = r for Njoic = r/At number of steps. 

It is easy to show that this scheme is reversible and preserves infinitesimal area of the phase space. Reversibility guarantees 
detailed balance of HMC, whereas area preservation ensures that there are no corrections due to integration measure. However, 
Hamiltonian is not conserved since 

H'-H= ^HAt^ + 0{At^) . 
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Metropolis et al Algorithm 

The HMC algorithm ends up accepting or rejecting the proposed gauge field {C^/t,i(T)} using Metropolis et al algorithm [4]. The 
acceptance probability for this algorithm is 

Pacc({P(0),C/(0)} ^ {P{r),U{T)}) = min{l,e^M-^(0)} 

On rejection, one goes back to time t = and refreshes momenta. 



Inversion Algorithms 

Calculation of forces and quark propagators requires the solution of hnear systems 

Dx = b, 

where D is a the lattice Dirac operator and b the right hand side. As we noted earlier, _D is a large and sparse matrix. For 
these matrices, Krylov subspace methods provide the most efficient inversion algorithms [5]. Such an algorithm is the Conjugate 
Gradients (CG) algorithm [6]. 



Conjugate Gradients 

Given an approximation xq, the algorithm starts with computation of the residual vector tq, 

ro = 6 - Axq , 

and initialisation of a vector pq to the starting residual, po = vq. Note that CG assumes that A is a positive definite and Hermitian 
matrix. Then, the algorithm iterates these vectors using recursions 

Xk+l = Xk+ OikPk 
Tk+l = Tk- Q-kApk 

Pk+1 = rk+1 + Pk+iPk , 



where 



Conjugate Gradients on Normal Equations 

Since the lattice Dirac operator is neither positive definite nor Hermitian one takes A = D*D and solves for the normal equations 

D*Dx = D*b . 

Then, we get what we call the Conjugate Gradients algorithm on Normal Equations (CGNE). As with standard CG, given an 
approximation xo, the algorithm computes starting residual ro, 

ro=b- Dxo , 

and initialises po to tq. Additionally, a new vector sq is initialised using sq = D*ro. The CGNE algorithm recursions are 

Xk+i = Xk + akPk 
rk+i = Tk- akDpk 
Pk+l = Sk+l + Pk+lPk , 

where 

Sfe+i = D*rk+i , 



and 



{opknopkY sisk 
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A Convergence Result 

It can be theoretically proven that CGNE algorithm converges linearly, i.e. 

where || . || denotes the Euclidean norm and k{D) is the condition number of the Dirac operator. In terms of singluar values of D, 
ci > C72 > • • • > (Tjv, the condition number is given by 

If D is rescaled such that ui = 1 we get (see [5], p52) 

It can be shown that CGNE calculates a solution of the minimisation problem 

min 116 — Dx\\ . 

But for non-normal matrices, such as the Wilson-Dirac operator, this solution is sub-optimal. Unlike CGNE, the General Min- 
imised Residual (GMRES) algorithm calculates the optimal solution of this problem. Both methods convegre according to the 
law described above. However, (t,v is traded for the spectral gap, which is larger (for non-normal matrices). As we wiU show 
later, GMRES algorithm is more expensive and often prohibitive in terms of computer resources. 

3 Foundations of Krylov Subspace Methods 

A Krylov subspace is the space built from the pair (ro, D): 

ICk = span{ro, Drg, . . . , Vq} , 

where 1 < k < m, with m being the rank of D. The simpliest example consists of the pair ro and the identity matrix I, in 
which case k — 1 and the Krylov subspace is simply the vector tq. In this case we have to do with an invariant subspace: further 
multiplications of ro by D = / will not increase the subspace. 

Iterative methods which seek solutions x in ICk are called Krylov subspace methods. If Qk = [qi, . . . ,qk] is & basis of 
othonormal vectors of ICk, the approximate solution can be written as 

k 

Xk=xo + '^ ViQi 

i=l 

In order to compute it one has to compute first qi,i = 1,2, ... ,k. There are two general approaches to compute y, namely 
i) Galerkin approach: choose y such that the residual vector r^ is orthogonal to fCk- 
h) Minimal residual approach: choose y such that ||rfe || is minimal. 

Basis generation: Amoldi Algorithm 

The method is a modified Gram-Schmidt orthogonalisation of ICk, in which the next vector is computed by 

fc 

Qk+i = Dqk - ^ qjhjk , 

and where the coefficients hjk are chosen such that the vectors come mutually orthonormal: 

q*qk+i = 0. 

From this condition we get 

f^jk = QjDqk ■ 

The algorithm that facihtates this process is called Arnoldi algorithm [7]. Having basis vectors one can construct two linear 
solvers, which are described in the following. 
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Algorithm 1 Arnoldi algorithm 

P=\\ro\\ 
qi = ro/p 
for fc = 1, . . . do 
w = Dqk 

for J = 1, . . . , fc do 

hkj = QjW 

w := w — Qjhjk 
end for 
hk+i,k = \\w\\ 
if hk+i,k = tlien 

stop 
end if 

Qk+i = w/hk+i,k 
end for 



FOM: Full Orthogonalisation Method 

If we denote Hk the matrix with elements hij,i,j = 1, . . . , fc, the result of Arnoldi decomposition can be written in matrix form: 

DQk = QkHk + hk+i,kQk+iek = Qk+iHk . 
Approximate solution can also be written as 

Xk=xo + QkVk ■ 

For the residual error vector one can write: 

Tk = h- Dxk 

= To - DQkVk 

= QiP - QkHkVk - /ife+i,fe9fe+iefe y/s • 
The Galerkin approach requires the next residual to be orthonormal to all previous vectors 

Qlrk = , 

which is 

<5fc9iP - QkQkHkVk - QkQk+ihk+i,kekyk = . 

Using orthonormaUty of Qfe, 

Q*kQi = ei, 
QlQk = h, 
Qk1k+1 = , 

one obtains the Unear system 

HkVk = dip ■ 

Note that Hk is an upper Hessenberg matrix and that the size of the problem depends on the value of fc, which is usually a much 
smaller than N, the order of the original problem. 

GMRES: Generalised Minimal Residual Method 

Arnoldi recurrences can be written also in the form: 

DQk = QkHk + hk+i,kqk+iek = Qk+iHk , 

where now Hk is an upper Hessenberg (fc + 1) x fc matrix, or the matrix Hk appended by the row hk+i^k^k- ^i*^ notations 
the residual error vector can be written as 

rk = qiP - Qk+iHkUk ■ 

1 



The minimal residual strategy of GMRES [8] requires that 

\\b - DxkW ^ min, Xk € ICk 
Substituting Xk = QkVk and qi = Qk+iei we get 

<3fe+i(eip - HkVk) 

Since Qfe+i is orthonormal, it can be ignored and we get the smaller least squares problem: 



eip - HkVk 



min, yfc G C . 



Krylov solvers are Polynomial Approximation solvers 

In fact, the approximate solution in the Krylov subspace 

XkeK,k = span{ro, Dro, D''~^ro} , 

can be described as a degree k — 1 polynomial applied to tq: 

Xk = a;o + Pk-i{D)ro . 

Then, the residual vector is a degree k polynomial apphed to ro'- 

Tk = ro - DPk-i{D)ro 
= [I-DPk-i{D)]ro 
= Rk{D)ro. 

Hence, GMRES solves the constrained minimisation problem: find the polynomial Rk such that 

\\Rk{D)ro\\ ^ min , i?fe(0) = 1 . 
In fact, this is a characterisation of Krylov subspace methods. One speaks of optimal polynomials generated in this way. 

But ... 

... GMRES requires to store all Amoldi vectors and its work grows proportionally to fc^. One can limit this growth of resources 
by restarting the algorithm after a given number of steps. Using this strategy, robustness is lost and sometimes convergence as 
well. Going back to normal equations, 

D*Dx = D*b , 

we know that the optimal polynomial is computed for D*D and not for D itself. However, computing resources remain constant 
in this case, an important advantage over GMRES. Therefore, a great deal of research has been devoted to methods which are as 
cheap as CGNE and yet have similar convergence to GMRES. 

BiCG75 

One of these methods is the specialisation of the Biconjugate Gradients (BiCG) algorithm in the case of the Wilson operator, 
which is 75-Hermitian (see [5] p47): 

D* = 75^75 . 

The method, coined BiCG75, can be formally obtained from CG by inserting a 75 operator whenever a scalar product occurs: 

U*V > W*75f • 

Since, 75 is a nondefinite operator, this scalar product may not exist, and a premature breakdown may occur. In practice we see 
an irregular behaviour of the residual vector norm history. 

BiCGStab 

Biconjugate Grandients Stabilised algorithm, or BiCGStab [9] replaces the redundant recursion of BiCG for a local noinimiser of 
the residual vector norm, thus giving a general and robust solver for non-Hermitian systems. 



8 



4 QCDLAB 1.0 



QCDLAB is designed to be a high level language interface for lattice QCD computational procedures. It is based on the MATLAB 
and OCTAVE language and environment. While MATLAB is a product of The Math Works, OCTAVE is its clone, a free software 
under the terms of the GNU General Public License. 

MATLAB/OCTAVE is a technical computing environment integrating numerical computation and graphics in one place, 
where problems and solutions look very similar and sometimes almost the same as they are written mathematically. Main 
features of MATLAB/OCTAVE are: 

• Vast Build-in mathematical and Unear algebra functions. 

• Many functions form Bias, Lapack, Minpack, etc. libraries. 

• State-of-the-art algorithms. 

• Interpreted language. 

• Dynamically loaded modules from other languages Uke C/C++, FORTRAN. 

• Ability to compile OCTAVE codes using the Startego Octave Compiler, Octave-Compiler . org. 

Hence, QCDLAB offers a two level language system: a higher level language, which is very popular for numerical work and 
a lower level translation to C++. In fact, if required the lower level can be further optimised for the particular hardware in place. 

The first version of QCDLAB is intended for work on the higher level only. QCDLAB 1.0 contains the following MAT- 
LAB/OCTAVE functions: 



Autocorel BiCGgS BiCGstab Binning cdotS 

CG CGNE Dirac_KS Dirac_r Dirac_W 

FOM Force_KS Force_W GMRES HMC_KS 

HMC_W Lanczos SCG SUMR wloop 



We divide them in two groups: simulation and inversion algorithms. We begin below with the description of simulation 
algorithms. 



4.1 Simulation Algorithms 

This section introduces QCDLAB 1.0 simulation tools of lattice QED2. In this case, lattice gauge fields U^^i can be expressed 
using angles O^i gM., 

whereas gauge action is given by 
where (3 = 1/e^, and electron charge e . 



Dirac Operators 

In case of Wilson fermions the Dirac operator is given by 

4 1 

Dij = (m + -)/2 Sij - - 2,[(/2 - o^)U^^i6i^j+jj, + {h + a^)U^^i-ij,5i^j-fi\ , 

(X A 

with being Pauli matrices. 




QCDLAB defines spin projection operators in terms of these matrices 
They are computed using this code: 
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% Form spin projection operators 

PI. plus = [l,l;l,l]/2; Pl.minus = [ 1 , - 1 ; - 1 , 1 ]/2 ; 
P2.plus = ;i ,l]/2; P2.minus = [ 1 , i ;-i , 1 ] /2 ; 

Given the quark mass, mass, the number of lattice sites along each direction, N, the total number of lattice sites, N2, the 
gauge field configuaration, Ul, and spin projector operators as input, Dirac_W returns Wilson matrix, Al: 

Al=Dirac_W(mass ,N,N2,U1, Pl.plus ,P2_plus ,Pl.minus ,P2_minus); 

For staggered fermions we have 

I 2 

where 

ei,i = 1, e2,i = (-1)'^ . 

Dirac-KS below returns the staggered matrix: 
Al=Dirac.KS(mass ,N,N2,U1); 

Forces 

In order to compute the force it is convenient to have ready a nearest neighbours Ust for each lattice site. The code that implement 
forward kp and backward km lists is given below. 

% Make nearest neighbours list 
for j2=0:N-l; 
for jl=0:N-l; 

k = 1 + jl + j2*N; 

kp(k,l) = 1 + mod(jl+l ,N) + j2*N; 

kp(k,2) = 1 + jl + mod(j2+l ,N)*N; 

km(k,l) = 1 + mod(jl -1+N,N) + j2*N; 

km(k,2) = 1 + jl + inod(j2-l+N,N)*N; 

end 
end 

In case of two degenerate Wilson fermions, the force is given by 

F^,, = -/3sin(^^,i + - e^^i+o - e.,i) + 2Re iU;^AxUKm + vt+f^V^Xi) , 

where pseudofermion fields are two-component complex valued vectors. Force_W returns both gauge pg and fermion pf 
pieces. Its arguments are: number of lattice sites, N2, forward neighbour list, kp, backward neighbour fist, km, angles, thetal, 
pseudofermion fields, eta, chi, gauge field, Ul, and spin projector operators. 

[ pf , pg ] = Force_W (N2 , kp ,km, thetal ,eta ,chi ,Ul,Pl_plus ,P2_plus , PI .minus , P 2 .minus ) ; 
In case of four degenerate Kogut-Susskind fermions the force is 

Fij.,i = -/3sin(6'^,i + e^,i+(i - O^^i+t, - 6»^,i) + 2Re iU*^,i{eti,i+ixXi+iim - efiAVi+fiXi) , 

where pseudofermion fields are complex values numbers. As in Wilson case, Force_KS returns gauge pg and fermion pf 
forces. 

[pf ,pg] = Force.KS (N2,kp ,km, thetal , eta , chi ,U1 ) ; 

Simulation Tools 

QCDLAB's simulation tools are HMC.W and HMC.KS. Their arguments are: 

iconf : set to zero for hot start, 
thetal: starting angles. 
On completion they return: 
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A2: Dirac operator on theta2 background, 
Plaq: plaquette history, 
Q_top: topological charge history, 
Wloop: ten smallest Wilson loops history, 
theta2: output angles, 

Stat: a four column array of Metropolis test history. 
One trajectory of Hybrid Monte Carlo algorithm consists of steps given in Algorithm 2. In case of a successful test, the 

Algorithm 2 HMC Trajectory in QCDLAB 

• Heatbath update of pseudofermion fields phi=Al*etaO; 

• Heatbath update of momenta P=randn(N2,2) ; 

• Compute Hamiltonian HI. 

• Invert Dirac operator chi=Al'\etaO; 

• Advance momenta half step P=P+pdot*deltat/2; 

• Start molecular dynamics loop. 

- Advance angles full step theta2=theta2+P*deltat ; 

- Compute inversions eta=A2 \phi ; chi=A2 ' \eta; 

- Advance momenta full step P=P+pdot*deltat; 

• Advance angles full step theta2=theta2+P*deltat ; 

• Compute inversions eta=A2 \phi ; chi=A2'\eta; 

• Advance momenta half step P=P+pdot*deltat/2; 

• Compute Hamiltonian H2 . 

• Perform Metropolis test. 

function computes topological charge, 

<3top = ^ sin(^^,i + dv,i+p, — 6fj,^i+v — 9v,i) , 

and n X n Wilson loops, 

/n— 1 n—1 n—1 n — 1 \ 

Wn = — COS ( ^/i,i+fe/i + ^L/,i+n(a+fe!> — 9n,i+nu+k(i ~ 9u,i+kO j • 

i,IJ.<L' \k=0 fc=0 fe=0 fe=0 / 

HMC_W and HMC_KS functions are called in the form given below. 

[A2,Plaq, Q_top , Wloop, theta2 , stat] = HMCLW( theta 1 , iconf ) ; 
[A2,Plaq, Q-top , Wloop, theta2 , stat] = HMCjCS( thetal , iconf ) ; 

Computation of Wilson Loops 

wloop returns ten smallest Wilsonloops. Angles 9^,i-,0^,i+ix, ■ ■ ■ ,^/a,i+9/i are implemented using arraystlegl,tleg2...tlogl0. 
These are summed over to give arrays wlegl,wleg2...wlogl 0. Then, each of these arrays is used to compute the correspond- 
ing Wilson loop around a square. The function is called using: 

wlp = wloop (N,N2, kp , km, thetal ) ; 
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Autocorrelations and Errors 



When measuring an observable, such as plaquette, we get time series of data in the form Oi,i = 1, 2, . . . , n. Standard error 
estimation procedures rely on the assumption that data are decorrelated. This can be checked by measuring the autocorrelation 
function 

A{t) = cov(O(0)O(i)) ~ e"*/*"" , 

where texp is called exponential autocorrelation time. Autocorel, which implements A{t), has two arguments: the data vector, 
X, and the maximal time interval, t : 

y= Autocorel (x , t ) ; 

In fact, a quick way to estimate the error is to block or 'bin' data. In this case one computes block averages and estimates the 
error of averages for increasing block size. Binning does exactly that. One must specify the original data, x, and the maximal 
block size, t. It returns a t-element vector err, containing error estimates for block sizes 1,2,. . .,t: 

err=Binning (x , t ) ; 

A simple recipe is to take t ~ fexp and choose the maximum value of err. 



Computing Projects 

The best way to test QCDLAB capabilities is to set up a simple computing project like the following: Compute square Wilson 
loops and topological charge using HMC_KS. Graph Wilson loops as a function of the linear size. Do you get a perimeter law? 
Plot the histogram of the topological charge. How is it distributed? 

As usuall, one opens two windows, one running MATLAB/OCTAVE, and one text editor where HMC_KS . m file is located. 
We present here an exmaple of miming OCTAVE with model and algorithmic parameters as in the Usting. Entering 

[A2,Plaq , Q.top , Wloop , theta2 , s ta t ]=HMC_KS ( [ ] ,0); 

one gets an output stream that looks something like: 

ans = 

3.00000 0.33333 0.57985 -1.25858 
ans = 

16.00000 0.12500 0.76736 -0.92516 
ans = 

20.00000 0.15000 0.75921 -0.27965 

We have displayed here only the first three lines. Columns of the stream display trajectory number, acceptance, plaquette and 
topological charge. One can store the angle output, theta2, and feed it into the next run: 

thetal=theta2 ; 

[A2,Plaq , Q.top , Wloop , theta2 , stat ]=HMCJCS( thetal ,1); 

In our example project we ran seven batches of HMC_KS and analysed results of the last batch. We computed and plotted 
autocorelation functions of five Wilson loops: 

auto = Autocorel (Wloop ( : , 1 ) , 20) ; 

semilogy ( auto ) 

hold 

auto = Autocorel(Wloop(: ,2) ,20); 
semilogy ( auto ) 

auto = Autocorel(Wloop(: ,3) ,20); 
semilogy ( auto ) 

auto = Autocorel(Wloop(: ,4) ,20); 
semilogy ( auto ) 

auto = Autocorel (Wloop ( : ,5) ,20); 

semilogy ( auto ) 

xlabel ( ' t ' ) 

ylabel ( ' Autocorel ' ) 

replot 

gset terminal postscript 
gset out ' Auto . ps ' 
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replot 

gset terminal xll 
hold 

1 



o 0-1 

o 
o 
o 

< 



0.01 



123456789 10 

t 

Then we computed central values of Wilson loops: 
w=mean ( Wloop ) ; 

The errors are estimated using Binning with the largest block size set to 10: 

sw=max( Binning (Wloop ( : ,1) ,10)); 
sw = [sw , max( Binning (Wloop (: ,2) ,10))]; 
sw = [sw ,max( Binning (Wloop (: ,3) ,10))]; 
sw = [sw ,max( Binning (Wloop ( : ,4) ,10))]; 
sw = [sw , max( Binning (Wloop (: ,5) ,10))]; 
sw = [sw ,max( Binning (Wloop (: ,6) ,10))]; 
sw = [sw ,max( Binning (Wloop ( : ,7) ,10))]; 
sw = [sw , max( Binning (Wloop (: ,8) ,10))]; 
sw = [sw ,max( Binning (Wloop (: ,9) ,10))]; 
sw = [sw ,max( Binning (Wloop (: ,10) ,10))]; 

Then, to plot Wilson loops we entered: 

semilogyerr(w(l:5) ,sw(l:5)) 
hold 

axis(L0,6,le-4,l]) 
semilogy (w( 1:5)) 
axis([0,6,le-3,l]) 
xlabel('Linear„Size') 
ylabel ( 'Wloop' ) 
replot 

gset terminal postscript 
gset out 'Wloop. ps' 
replot 

gset terminal xll 
hold 
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Finally, to produce the histogram plot of topological charge is very easy: 

hist (Q_top ,30) 
xlabel ( ' Q-top ' ) 
ylabel ( ' Frequency ' ) 
gset terminal postscript 
gset out 'Q_top.ps' 
replot 

gset terminal xll 
hold 
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4.2 Inversion Algorithms 



In this section we list QCDLAB functions which implement basic Krylov subspace inverters for use in lattice guage theories. In 
this version the whole matrix A, be it sparse or dense, should be supplied as an argument, together with the right hand side b, 
the approximate soluction x_0 , the tolerance t o 1 and the maximum number of iterations nmax. CG . m and CGNE . m functions 
below return solution x and recursive residual vector norm history rr. 

[x,rr] = CG(A, b , xO , tol , nmax) ; 
[x,rr] = CGNE(A,b,xO , tol , nmax); 

FOM . m and GMRES . m return true residual instead of recursive residual and Amoldi matrix H as well. This can be used to 
compute approximate egienvalues of the original matrix A. 

[z,rt,H] = FOM(A,b,zO , tol , nmax); 
[z,rt,H] = GMRES(A,b,zO , tol , nmax); 

BiCG75 and BiCGstab 

The BiCGgS function may be used for 75-Hermitian operators: 
[x,rr] = BiCGg5(A,b ,xO , tol ,nmax) ; 

It uses a special inner product, a pseudo-scalar which does not lead to a vector norm. Hence, the algorithm can break down 
prematurely. There are look ahead strategies which cure this problem. We don't employ them. However, to avoid a starting 
breakdown the recipe is to use a non-trivial initial solution. 

BiCGstab is a general non-Hermitian solver. It inherits from BiCG the premature breakdown problem. In order to avoid a 
starting breakdown we use a random initial left Lanczos vector y . 

[z , rr ] = BiCGstab(A,b, zO , tol ,nmax); 

Symmetric Lanczos algorithm 

The counterpart of Amoldi algorithm for Hermitian matrices is Lanczos algorithm [10]. From Amoldi algorithm we have 

Hk = QlAQk . 

If A is Hermitian we can write 

= Q*k^*Qk = Qk^Qk = Hk ■ 

Since Hk is upper Hessenberg and Hermitian, it must be tridiagonal. It is cormnonly denoted by T^. This way, after k Lanczos 
steps we have 

-^Qk = QkTk + PkQk+iek , 

where 

/ai Pi \ 

Tk= 

■■■ ■■■ Pk-i 

\ Pk-l Oik ) 

QCDLAB function Lanczos arguments are matrix A, starting vector b and maximal number of steps nmax. It retums 
Lanczos vectors Q and matrix T: 

[Q,T] = Lanczos (A,b , nmax ) ; 

Computing Projects 

In this section we illustrate QCDLAB inverter functions for Wilson fermions. The matrix A, a function argument, is an output 
of HMC_W. Having an angle configuration, one can generate it using Dirac_r function: given the quark mass mass, the Wilson 
parameter r, the lattice size N, the total number of lattice sites N2, and the angle configuration thetal as arguments, it returns 
the Wilson-Dirac matrix Al: 

Al=Dirac_r (mass , r ,N,N2, thetal ) ; 

Note that knowing A is not essential. Indeed, any user supplied procedure of matrix-vector multiplication can be called 
whenever =A* occurs in the function. Future releases of QCDLAB will provide capabihties that implement this. 
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Wilson Fermions 

In our example project here, we used an angle configuration on a 16 x 16 lattice. We loaded this configuration and created a right 
hand side and a starting solution. Then we generated three WUson-Dirac matrices of three different fermion masses: 

load thetal6 

N=16;N2=N"2;b=zeros(2*N2, l);xO=b;b(l)=l; 
Al=Dirac_r( -0.1 ,1 ,N,N2, theta2 ); 
A2=Dirac.r ( -0.05,1 ,N,N2, theta2 ); 
A3=Dirac.r(0,l ,N,N2, theta2 ); 

A first thing to do is to compute and plot eigenvalues of the massless operator: 

e3 = eig(A3); 
plot (e3 , ' o ' ) 
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In order to compare GMRES and CGNE convergence one calls respective solvers for each fermion mass as follows: 

tol=le — 13;nmax=2*N2; 

[x.gmresl ,r_gmresl ,H1] = GMRES( Al , b , xO , tol ,nmax) 
[x_gmres2 , r_gmres2 ,H2] = GMRES( A2, b , xO , tol ,nmax) 
[x_gmres3 , r_gmres3 ,H3] = GMRES( A3 , b , xO , tol ,nmax) 
[x.cgnel ,r_cgnel] = CGNE( Al , b , xO , tol ,nmax); 
[x_cgne2,r_cgne2] = CGNE( A2 , b , xO , tol , nmax ) ; 
[x_cgne3,r_cgne3] = CGNE( A3 , b , xO , tol , nmax ) ; 

Then one plots the residual norm history as a function of matrix- vector multiplications calls: 

k_gmres 1 =max( size (r_gmresl )) 
k_gmres2=max( size ( r_gmres2 ) ) 
k_gmres3=max( size (r_gmres3 )) 
k_cgne 1 =2*max( size(r_cgnel )) 
k_cgne2=2*max( size(r_cgne2)) 
k_cgne3 =2*max( size (r_cgne3 )) 
semilogy ( 1 : k.gmresl , r_gmresl , ' ; gmresl ; ' ) 
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hold 

semilogy ( 1 : k_gmres2 , r_gmres2 , ' ; gmres2 ; ' ) 
semilogy ( 1 : k_gmres3 , r_gmres3 , ' ; gmres3 ; ' ) 
semilogy (l:2:k_cgnel ,r_cgnel , ' ;cgnel ; ') 
semilogy (l:2:k_cgne2 ,r_cgne2 , ' ;cgne2; ') 
semilogy(l:2:k_cgne3 ,r_cgne3 , ' ; cgne3 ; ' ) 
xlabel('#^matrix— vector') 
ylabel ( ' residual „norm' ) 
replot 

gset terminal postscript 
gset out 'conv_hist.ps' 
replot 

gset terminal xll 
hold 

Note that GMRES has an additional overhead which grows like i^, rendering the algorithm useless for large i. Hence, the 
GMRES convergence, measured in terms of matrix-multiplication number, should be considered as a theoretically ideal result 
and a benchmark for the performance of short-recurrences algorithms. 
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The difficulty of GMRES exploding resources can be avoided using BiCGgS and BiCGstab functions. Employing the 
lightest mass one can compare convergence history of all solvers: 

xO=rand(2*N2, 1 ); 

[ x_gmres 1 , r_gmres 1 ,H1 ] = GMRES(A1 , b , xO , tol , nmax ) ; 

[x.cgnel ,r_cgnel] = CGNE(A1 , b , xO , tol ,nmax); 

[ x.bicg 1 , r_bicg 1 ] = BiCGgS (Al , b , xO , tol , nmax ) ; 

[x.bicgstabl , r.bicgstab 1 ] = BiCGstab (Al , b , xO , tol ,nmax); 

k_gmresl=max( size (r.gmresl )) ; 

k_cgne 1 =2*max( size(r_cgnel)); 

k_bicg l=max( size ( r_bicg 1 )); 

k.bicgstabl =2*max( size(r_bicgstabl )); 

semilogy(l: k_gmresl , r_gmresl , ' ; gmresi ; ' ) 

hold 
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semilogy (l:2:k_cgnel ,r_cgnel , ' ;cgnel ; ') 
semilogy (l:k_bicgl ,r_bicgl ,';bicgl;') 
semilogy ( 1 : 2 : k.bicgstab 1 ,r_bicgstabl ,'; 
xlabel('#^matrix— vector') 
ylabel ( ' residual ^norin' ) 
repl ot 

gset terminal postscript 

gset out 'all_solver_conv_hist.ps' 

replot 

gset terminal xll 
hold 
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As seen from the plot, the best short-recurrences solver (for this particular example) is BiCGgS. It converges at about the 
same matrix-vector multiplications as GMRES and yet avoiding its pitfalls. Its irregular convergence history can be softened to 
the level of BiCGstab using the quasiminimal residual approach, or QMR algorithm, which is not described here. 



Staggered Fermions 

Staggered operator is anti-Hermitian as can be illustrated below: first create the staggered matrix; then typing norm (A-i-A' ) , 
the answer will be zero. Since i *A is Hermitian, its eigenvalues will be real. We plotted the eigenvalues as sorted from the eig 
function. 

Ul=cos ( theta2)H- sqrt(-l)*sin ( theta2 ) ; 

N=16;N2=N"2; 

A=Dirac.KS (0 ,N,N2,U1); 

norm(AH-A' ) 

ea=eig ( i *A) ; 

plot ( ea , ' o ; ea ; ' ) 

Theory tells that for normal matrices, such as staggered operator, CGNE is an optimal solver. The following plot compares 
GMRES and CGNE convergence history as a function of matrix- vector multiplications counter. 



b=zeros(N2, l);xO=b;b(l)=l; 
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tol=le-13;nmax=N2; 

[x.gmres , r.gmres ,H| = GMRES(A, b , xO , tol , nmax ) ; 
[x_cgne,r_cgne] = CGNE( A, b , xO , tol , nmax ) ; 
k_cgne=max( size(r_cgne)); 
semilogy ( r.gmres , ' ;gmres; ') 
hold 

seinilogy(l:2:2*k_cgne ,r_cgne , ' ;cgne; ') 
xlabel('#^matrix— vector') 
ylabel(' residual ^norm ' ) 
replot 

gset terminal postscript 
gset out 'conv_hist_ks.ps' 
replot 

gset terminal xll 
hold 
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CG-Lanczos Equivalence 

As another application, one can use Lanczos algorithm to solve the linear system 

D*Dx = D*b 

for Wilson matrix D. Taking the lightest mass from the previous exmaple and setting maximal iteration number to k.cgneS: 
[Q,T] = Lanczos (A3' *A3 ,A3' *b , k_cgne3 ) ; 
one can construct the solution using: 

X .lanczos 3 =Q( : , 1 : k_cgne3 ) * (T\ [norm (A3' *b);zeros( k_cgne3 — 1 , 1 )]) ; 

Comparison to x_cgne3 yields: 

norm( x_cgne3— x_lanczos3 ) 
ans = 2.5537e-14 

This example illustrates the theoretical result that CG and Lanczos algorithms are equivalent linear solvers in exact arithmetic. 
4.3 Ginsparg-Wilson Fermions 

In this section we describe QCDLAB functions for use with lattice chiral fermions. A chiral lattice Dirac operator satisfies the 
Gisnparg- Wilson relation [11]: 

One solution to this relation is the Nueberger overlap operator [12], 

1 + m 1 -m 

where sgn{.) is the signum function, Hw = ^?>Dw- For the signum function to be nontrivial, the Wilson-Dirac operator should 
be indefinite, which is the case if its bare mass M is sufficiently negative. This is usually taken to be in the interval (—2, 0). 
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Another form of the Neuberger operator is 

If we express Dw in terms of its singular values and vectors, 

Dw = UEV* , 

we get 

Since U and V are unitary operators, so it is UV*. Hence, the overlap operator is a shifted unitary operator. Iterative inverters 
for such operators can be simplified as we show below. Before doing this we give an iterative method in order to compute the 
overlap operator. 

Lanczos Algorithm for the Overlap Computation 

In order to apply the overlap operator to a vector b one should first perform the inversion (D^Dw Y^^x = b and then apply Dw 
to X. The calculation is based on the following integral representation for the inverse square root [13]: 

{D*^Dw)-^'^ = - / dt{t^ + D*wDw)-' . 

Jo 

From the previous section we know that one can use the Lanczos algorithm to solve the linear systems, such as 

{D;^Dw)-'b = QkT-^eip, (4.1) 
where p = \\b\\. Since by shifting the matrix DI^Dw one obtains the same Lanczos vectors, we can write 

(t^ + D*wDw)-^b = Qkit" + Tk)-^eip . (4.2) 
Using the above integral representation again, but now for Lanczos matrix Tk we get 

X = {D*y^Dw)-^'% = QkT^^'^eip . (4.3) 
To sunmiarise, in order to find x one computes: 

• Qk and T/j using the Lanczos function on D^Dw and b, 

— 1/2 

• then computes y/j = Tj(. ' eip, 

• and finally x = QkVk- 

Using the A3 , b pair of the last example one can enter these commands: 

b=rand(2*N2,l); 
rho=norm(b ) ; 

[Q,T] = Lanczos((A3-eye(5 12)) ' * (A3-eye (5 1 2)) ,b,200); 

[N,k]=size (Q); 

el = zeros(k-l ,1); el(l) = l; 

y=sqrtm(T)\el*rho ; 

x=(A3-eye(512))*Q(: ,l:k-l)*y; 

In case Qk vectors are too large to be stored in the main memory of the computer, one can modify Lanczos such that it does 
not accumulate Lanczos vectors into Q. In this case, one calculates yk and then repeats the Lanczos iteration in order to form x. 
This is the so called double pass algorithm. 

For small problems, as it is usually the case for QED2 on the lattice, one can compute the overlap operator using the singular 
value decomposition: 
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[U.W, S_W , V_W] = svd ( A3-ey e ( 5 1 2 ) ) ; 
V=U.W*V_W' ; 
z=V*b ; 
norm( z x) 
D=eye(512)+V; 
eigD=eig (D); 
plot (eigD , ' + ' ) 

In this case it is easy to compute D eigenvalues using direct methods, such as eig function. The following plot shows the 
eignevalues of the massless overlap operator D. 
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Inversion of the Overlap Operator 

Having computed D the next step is its inversion. Before discussing this, we note that D*D is optimally inverted using the 
GGNE algorithm, the reason being the normality of D, i.e. D*D = DD*. Hence, for dynamical fermion simulations, which 
require D*D inversions, CGNE is the prefered method. For propagator calculations the situation is less clear. This has to do 
with differing spectral properties of D*D and D. 

SUMR Algorithm 

We know that GMRES is the optimal method for non-Hermitian operators. We know also that GMRES requires very often 
prohibitive computer resources. However, for unitary matrices one can do better. Exploiting this fact, one can construct the 
SUMR or the Shifted Unitary Minimal Residual algorithm [14], which is charaterised by short-recurrences and at the same time 
benefit from the optimal properties of GMRES: 

[rr ,x] = Sl]MR(b ,V, rho , zeta , tol , imax ) ; 

Semiconjugate Gradients Algorithm 

We know that the POM is the counterpart of GMRES for the Galerkin approach to Unear system solvers. Likewise one can 
construct the counterpart of SUMR for shifted unitary systems. This can be done using a new Amoldi process for unitary matrices, 
the Arnoldi Unitary Process. In the coupled recurrences variant, the search directions of the algorithm are semiconjugate. 
Therefore, the algorithm is called the Semiconjugate Gradients (SCG) algorithm [15]: 

[x,rr] = SCG{A,b,xO, tol ,nmax); 
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Computing Projects 



In this section we compare CGNE, SUMR and SCG algorithms in the same background U(l) gauge field as before and the same 
Wilson-Dirac matrix A3: 

m=0.0001; 

D=(l+m)/2*eye(512) + (l-m)/2*V; 
b=zeros(512,l);b(l)=l; 

[x,rr] = CGNE(D,b,zeros(512,l),le-12,200); 
semilogy(l:2:2*max(size(rr)) , rr , ' -;CGNE; ') 
hold 

[x,rr] = SCG(D,b, zeros(512,l) ,le-12,200); 
semilogy(rr , ' .;SCG; ') 

[rr,x] = SUMR(b,V,(l+m)/2,(l -m)/2,le-12,200); 

semilogy (rr , ' : ;SUMR; ' ) 

xlabel("# matrix —vector ") 

ylabel (" residual norm") 

replot 

gset terminal postscript 
gset out "gw_conv_hist.ps" 
replot 

gset terminal xll 
hold 

The result of this comparison is shown in the following figure. One observes the optimal properties of SUMR, as expected. 
We note that SCG is doing worse at the begining until it reaches the asymtotic regime of SUMR. Both SUMR and SCG are 25% 
faster then CGNE. 
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5 QCDLAB 2.x 

In the previous sections we described version 1 .0 of QCDLAB. Its simulation functionaUty is Umited to the QED2 on the lattice, 
a very good laboratory for algorithmic and ideas exploration in lattice QCD. 
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We plan to extend functionality of QCDLAB 1 .0 further. We are in the designing phase of QCDLAB 2.0 which will be totaly 
devoted to lattice QCD simulation. The gross features of this future version are expected to be: 

• Dynamically linked functions to already exsisting procedures in other languages. 

• Ability to compile MATLAB/OCTAVE codes using the Startego Octave Compiler, Octave-Compiler . org. 

• Scalability on various computing platforms using the above compiler 

• Extended functionality, in particular simulation functions for lattice QCD and matrix- vector multiplication procedures for 
various fermion operators. 

References 

[1] K. G. Wilson, in New phenomena in subnuclear Physics, ed. A. Zichichi (Plenum, New York, 1977 
[2] J. Kogut and L. Susskind, Phys. Rev. D 1 1 (1974) 395 

[3] S. Duane, A. D. Kennedy, B. J. Pendleton and D. Roweth, Phys. Lett. B195 (1987) 216 

[4] N. A. Metropolis, N. M. Rosenbluth, A. H. Rosenbluth, E. Teller and J. Teller, J. Chem. Phys. 21 (1953) 1087 

[5] For a review of these methods in lattice QCD see A. Borigi, Krylov Subspace Methods in Lattice QCD, PhD thesis, CSCS 
TR-96-27, ETH Zurich 1996 

[6] M. R. Hestenes and E. Stiefel, J. Res. Nat. Bur. Stand. 49 (1952) 409 

[7] W. E. Arnoldi, Quart. Appl. Math. 9 (1951) 17 

[8] Y. Saad and M. H. Schultz, SIAM J. Sci. Stat. Comput. 7 (1986) 856 

[9] H. A. Van der Vorst, SIAM J. Sci. Stat. Comput., 13 (1992) 631 
[10] C. Lanczos, J. Res. Nat. Bur. Stand. 49 (1952) 33 
[11] P H. Ginsparg and K. G. Wilson, Phys. Rev. D 25 (1982) 2649. 
[12] H. Neuberger, Phys. Lett. B 417 (1998) 141 
[13] A. Borigi, J. Comput. Phys. 162 (2000) 123-131 

[14] C. R Jagels und L. Reichel, Num. Lin. Algeb. Appl., Vol. 1(6), 555-570 (1994) 
[15] A. Bori§i, in preparation. 



24 



